This is the second part of the project “Predicting Crime Rates throughout the USA” and we will focus on the Violence variable in this HTML document. Since the majority of the depuration part (e.g. checking for NAs, checking for class balance in target variables, reviewing collinearity etc.) follow the same procedure, these parts will be skipped to avoid repetition and the changes in the data will be applied directly. For a more context and a detailed overview, please review the depuration part of the Murder Analysis (part one of “Predicting Crime Rates throughout the USA”).
rm(list = ls())
library("readxl")
crime <- read_excel("/Users/felipereyes/iCloud Drive (Archiv)/Documents/UNI/complutense/TFM/DATA.CommViolPredUnnormalizedData 2.xlsx")
#install.packages("")
library(skimr)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(corrplot)
library(fastDummies)
library(corrr)
library(Hmisc)
library(parallel)
library(doParallel)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(gridExtra)
library(tidymodels)
library(outliers)
library(themis)
library(rpart.plot)
library(performance)
library(ggthemes)
library(glue)
library(ranger)
library(vip)
library(ggrepel)
library(Rmisc)
library(MASS)
library(MXM)
library(Boruta)
library(caret)
library(pROC)
library(gridExtra)
library(visualpred)
library(car)
library(readxl)
library(DataExplorer)
library(missForest)
library(plotly)
library (lubridate)
As seen in the Murder Analysis, many variables had “?” instead of NAs. Therefore, they were classified as qualitative and not quantitative. Lets extract the qualitative columns and replace “?” with NA.
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
library(dplyr)
qual_var <- crime %>% select_if(is.character)
qual_var <- qual_var %>% mutate_all(~na_if(.,"?"))
qual_var
# A tibble: 2,215 Ă— 116
communityname state countyCode communityCode householdsize
<chr> <chr> <chr> <chr> <chr>
1 BerkeleyHeightstownsh… NJ 39 5320 3.1
2 Marpletownship PA 45 47616 2.82
3 Tigardcity OR <NA> <NA> 2.43
4 Gloversvillecity NY 35 29443 2.4
5 Bemidjicity MN 7 5068 2.76
6 Springfieldcity MO <NA> <NA> 2.45
7 Norwoodtown MA 21 50250 2.6
8 Andersoncity IN <NA> <NA> 2.45
9 Fargocity ND 17 25700 2.46
10 Wacocity TX <NA> <NA> 2.62
# ℹ 2,205 more rows
# ℹ 111 more variables: racepctblack <chr>, racePctWhite <chr>,
# racePctAsian <chr>, racePctHisp <chr>, agePct12t21 <chr>,
# agePct12t29 <chr>, agePct16t24 <chr>, agePct65up <chr>,
# pctUrban <chr>, pctWWage <chr>, pctWFarmSelf <chr>,
# pctWInvInc <chr>, pctWSocSec <chr>, pctWPubAsst <chr>,
# pctWRetire <chr>, PctPopUnderPov <chr>, PctLess9thGrade <chr>, …
Lets exclude communityname, state, country code and communityCode from the new dataset in order to convert the rest into numeric values and then we put the qualitative variables back into the dataset, to maintain the original dataset under “crime” including all variables.
Now that they are all numeric, lets exclude the target variables we wont use, and also exclude the four qualitative variables that serve as identifier for the community once again and save this into a new dataset called “crime2”, to leave “crime” untouched.
Apart from nulls, also outliers have to be reviewed.
Furthermore, after having corrected the dataset for outliers and NAs, we will create the binary target variables.
For the following sections the variables will be divided into categories identified previously in this research (Economic State/Wealth, Demographic Characteristics, Police-Related Factors and Family Structure), in order to have a better vision and understanding. At the end of this section, two different datasets with the corresponding variables identified as most important will be created, one for each target variable (Violence and Murder).
Variables:
medIncome, medFamInc, perCapInc, whitePerCap, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, HispPerCap, pctWWage, pctWFarmSelf, pctWInvInc, pctWSocSec, pctWPubAsst, pctWRetire, OwnOccLowQuart, OwnOccMedVal, OwnOccHiQuart, OwnOccQrange, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg
box1 <-
ggplot(crime2, aes(ViolentCrimesPerPop, medIncome)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(ViolentCrimesPerPop, medFamInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(ViolentCrimesPerPop, perCapInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(ViolentCrimesPerPop, whitePerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(ViolentCrimesPerPop, blackPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(ViolentCrimesPerPop, indianPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(ViolentCrimesPerPop, AsianPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(ViolentCrimesPerPop, OtherPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(ViolentCrimesPerPop, HispPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctWWage)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctWFarmSelf)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctWInvInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctWSocSec)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box15 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctWPubAsst)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box16 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctWRetire)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box17 <-
ggplot(crime2, aes(ViolentCrimesPerPop, OwnOccLowQuart)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box18 <-
ggplot(crime2, aes(ViolentCrimesPerPop, OwnOccMedVal)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box19 <-
ggplot(crime2, aes(ViolentCrimesPerPop, OwnOccHiQuart)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box20 <-
ggplot(crime2, aes(ViolentCrimesPerPop, OwnOccQrange)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box21 <-
ggplot(crime2, aes(ViolentCrimesPerPop, RentLowQ)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box22 <-
ggplot(crime2, aes(ViolentCrimesPerPop, RentMedian)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box23 <-
ggplot(crime2, aes(ViolentCrimesPerPop, RentHighQ)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box24 <-
ggplot(crime2, aes(ViolentCrimesPerPop, RentQrange)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box25 <-
ggplot(crime2, aes(ViolentCrimesPerPop, MedRent)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box26 <-
ggplot(crime2, aes(ViolentCrimesPerPop, MedRentPctHousInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box27 <-
ggplot(crime2, aes(ViolentCrimesPerPop, MedOwnCostPctInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box28 <-
ggplot(crime2, aes(ViolentCrimesPerPop, MedOwnCostPctIncNoMtg)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2)
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2)
multiplot( box21, box22, box23, box24, box25, box26, box27, box28, cols = 2)
If we look at these box plots, all the variables referring to economic state are asymmetrical, so outliers are being detected and just as missing values ones will be imputed by the median. The only two variables that do not seem to have outliers are RentHighQ and MedRent.
Variables:
population, householdsize, racepctblack, racePctWhite, racePctAsian, racePctHisp, agePct12t21, agePct12t29, agePct16t24, agePct65up, numbUrban, pctUrban, PctImmigRecent, PctImmigRec5, PctImmigRec8, PctImmigRec10, PctRecentImmig, PctRecImmig5, PctRecImmig8, PctRecImmig10, PctSpeakEnglOnly, PctNotSpeakEnglWell, PctForeignBorn, PctBornSameState, PctSameHouse85, PctSameCity85, PctSameState85
box1 <-
ggplot(crime2, aes(ViolentCrimesPerPop, population)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(ViolentCrimesPerPop, householdsize)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(ViolentCrimesPerPop, racepctblack)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(ViolentCrimesPerPop, racePctAsian)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(ViolentCrimesPerPop, racePctHisp)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(ViolentCrimesPerPop, agePct12t21)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(ViolentCrimesPerPop, agePct12t29)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(ViolentCrimesPerPop, agePct16t24)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(ViolentCrimesPerPop, agePct65up)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(ViolentCrimesPerPop, numbUrban)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(ViolentCrimesPerPop, pctUrban)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctImmigRecent)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctImmigRec5)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box15 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctImmigRec8)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box16 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctImmigRec10)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box17 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctRecentImmig)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box18 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctRecImmig5)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box19 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctRecImmig8)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box20 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctRecImmig10)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box21 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctSpeakEnglOnly)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box22 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctNotSpeakEnglWell)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box23 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctForeignBorn)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box24 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctBornSameState)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box25 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctSameHouse85)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box26 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctSameCity85)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box27 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctSameState85)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box28 <-
ggplot(crime2, aes(ViolentCrimesPerPop, racePctWhite)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2)
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2)
multiplot( box21, box22, box23, box24, box25, box26, box27, box28, cols = 2)
Once again almost all variables have outliers. pctUrban, instead of outliers, has a very high number of values in the extremes of the general distribution within the variable. We see the median is on the right extreme side which indicates that the distribution is heavily skewed, since many values are concentrated there. In general most variables are strongly skewed to either the left or right and have a high number of outliers.
box1 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasSwornFT)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasSwFTPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasSwFTFieldOps)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasSwFTFieldPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasTotalReq)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasTotReqPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PolicReqPerOffic)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PolicPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(ViolentCrimesPerPop, RacialMatchCommPol)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctPolicWhite)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctPolicBlack)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctPolicHisp)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctPolicAsian)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box15 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctPolicMinor)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box16 <-
ggplot(crime2, aes(ViolentCrimesPerPop, OfficAssgnDrugUnits)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box17 <-
ggplot(crime2, aes(ViolentCrimesPerPop, NumKindsDrugsSeiz)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box18 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PolicAveOTWorked)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box19 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LandArea)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box20 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PopDens)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box21 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctUsePubTrans)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box22 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PolicCars)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box23 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PolicOperBudg)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box24 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasPctPolicOnPatr)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box25 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasGangUnitDeploy)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box26 <-
ggplot(crime2, aes(ViolentCrimesPerPop, LemasPctOfficDrugUn)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box27 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PolicBudgPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2)
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2)
multiplot( box21, box22, box23, box24, box25, box26, box27, cols = 2)
Again, there are outliers present in almost all of the variables. The only one without outliers is LemasGangUnitDeploy, which has however the most interesting distribution, since it has an exceptional high variability.
Variables:
MalePctDivorce, MalePctNevMarr, FemalePctDiv, TotalPctDiv, PersPerFam, PctFam2Par, PctKids2Par, PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctWorkMom, NumKidsBornNeverMar, PctKidsBornNeverMar
box1 <-
ggplot(crime2, aes(ViolentCrimesPerPop, MalePctDivorce)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(ViolentCrimesPerPop, MalePctNevMarr)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(ViolentCrimesPerPop, FemalePctDiv)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(ViolentCrimesPerPop, TotalPctDiv)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PersPerFam)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctFam2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctKids2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctYoungKids2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctTeen2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctWorkMomYoungKids)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctWorkMom)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(ViolentCrimesPerPop, NumKidsBornNeverMar)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(ViolentCrimesPerPop, PctKidsBornNeverMar)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, cols = 2)
multiplot( box8, box9, box10, box11, box13, box14, cols = 2)
It can be observed that the only variable without outliers is TotalPctDiv. It is even distributed evenly, as its median is in the center and whiskers have similar length. But also FemalePctDiv seems to be evenly distributed despite an outlier on the extreme right.
Now that we have identified NAs and have tested for outliers, we have come to the conclusion that almost all variables have outliers. Therefore, we will have to correct these. For the outiers we will use z type with 2.5 in order to detect outliers accordingly and then impute, as mentioned above, the median. In the same step, the NAs present in the variables will be fixed as well.
crime2 <-
crime2 |>
# Detect outliers and label them as NAs
dplyr::mutate(across(all_numeric_predictors(), function(x) { ifelse(abs(scores(x, type = "z")) > 2.5 & !is.na(x), NA, x) })) |>
# Impute NAs
mutate_if(is.numeric, ~ifelse(is.na(.), mean(., na.rm = TRUE), .)) |>
mutate_if(is.character, ~ifelse(is.na(.), mode(., na.rm = TRUE), .))
As seen already in the Murder Analysis both target variables are skewed to the right, wherefore the median will be used to calculate the new binary variables murder and violence.
median <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
crime2_v <- crime2 %>% select(-c(ViolentCrimesPerPop, murdPerPop))
crime2_m <- crime2 %>% select(-c(ViolentCrimesPerPop, murdPerPop))
median_v <- median(crime2$ViolentCrimesPerPop)
crime2_v$violence <- ifelse(crime2$ViolentCrimesPerPop > median_v, 1, 0)
median_m <- mean(crime2$murdPerPop)
crime2_m$murder <- ifelse(crime2$murdPerPop > median_m, 1, 0)
crime2 <- crime2 %>% select(-c(ViolentCrimesPerPop, murdPerPop))
variable <- crime2_v %>% select(violence)
variable2 <- crime2_m %>% select(murder)
crime2 <- cbind(crime2, variable, variable2)
##crime2_v is dataset with all variables plus violence target variable
## crime2_m is dataset with all variables plus murder target variable
## crime2 is dataset with all variables plus both target variable
We created the two datasets for the two target variables: crime2_v and crime2_m.
The balances of the values in the single target variables, as seen in the Murder Analysis, are the following:
Violence:
| Violence | Accuracy Base | Error Rate Base |
|---|---|---|
| 0 | 54.99 % | - |
| 1 | - | 45.01 % |
Murder:
| Murder | Accuracy Base | Error Bate Base |
|---|---|---|
| 0 | 65.64% | - |
| 1 | - | 34.36% |
The class balance for murder is worse, so oversampling has to be done.
library(ROSE)
# Perform random oversampling
set.seed(1234)
crime2_m_balanced <- ovun.sample(factor(murder) ~ ., data = crime2_m, method = "over", N = 2 * table(crime2_m$murder)[1])$data
table(crime2_m_balanced$murder)
0 1
1454 1454
# murder
table_hd <- table_hd <- table(crime2_m_balanced$murder)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 50 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 50 %
sum(crime2_m_balanced$murder == 0)
[1] 1454
sum(crime2_m_balanced$murder == 1)
[1] 1454
crime2_m <- crime2_m_balanced
It can be seen that the variables for both target variables are well balanced now.
Now that we fixed outliers and NAs and have our target variables ready, we will fix for collinearity and also check for the representation of the target variable in the independent variables. Similar to the outliers section, we will divide this section into the different variable groups.
For collinearity, the heat maps can be found in the Murder Analysis. Here we will just select the variables we decided to include in the final dataset. A value equal or above 0.5 can be considered as high and at least one of the affected variables must be excluded. The representation however, is unique to our target variable and will therefore be visualized and discussed in this section.
Adding the chosen variables to the new dataset after the collinearity check done in the Murder analysis:
crime2_v <-crime2_v |> select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg, violence)
library(ggtext)
age_freq <- table(crime2_v$perCapInc, crime2_v$violence)
age_freq <- as.data.frame(age_freq)
colnames(age_freq) <- c("perCapInc", "violence", "FRECUENCIA")
a1 <- crime2_v |> ggplot(aes(x = violence, y = perCapInc, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in perCapInc")
a2 <- crime2_v |> ggplot(aes(x = violence, y = blackPerCap, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in blackPerCap")
a3 <- crime2_v |> ggplot(aes(x = violence, y = indianPerCap, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in indianPerCap")
a4 <- crime2_v |> ggplot(aes(x = violence, y = AsianPerCap, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in AsianPerCap")
a5 <- crime2_v |> ggplot(aes(x = violence, y = OtherPerCap, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in OtherPerCap")
a6 <- crime2_v |> ggplot(aes(x = violence, y = pctWWage, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in pctWWage")
a7 <- crime2_v |> ggplot(aes(x = violence, y = MedRentPctHousInc, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in MedRentPctHousInc")
a8 <- crime2_v |> ggplot(aes(x = violence, y = MedOwnCostPctInc, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in MedOwnCostPctInc")
a9 <- crime2_v |> ggplot(aes(x = violence, y = MedOwnCostPctIncNoMtg, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in MedOwnCostPctIncNoMtg")
multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9, cols = 2)
We can see that in each variable the values of violence only have small differences in their distribution, since the for 0 and 1 the box plots show a high degree of similarities. However, for perCapInc it can be seen that the variance is higher for the values belonging to 0 and also more outliers are present. This also holds for AsianPerCap and pctWWage.
Adding the chosen variables to the new dataset after the collinearity check done in the Murder analysis:
a1 <- crime2_v |> ggplot(aes(x = violence, y = population, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in population")
a2 <- crime2_v |> ggplot(aes(x = violence, y = racepctblack, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in racepctblack")
a3 <- crime2_v |> ggplot(aes(x = violence, y = racePctWhite, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in racePctWhite")
a4 <- crime2_v |> ggplot(aes(x = violence, y = agePct12t29, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in agePct12t29")
a5 <- crime2_v |> ggplot(aes(x = violence, y = PctImmigRecent, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctImmigRecent")
a6 <- crime2_v |> ggplot(aes(x = violence, y = PctSpeakEnglOnly, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctSpeakEnglOnly")
a7 <- crime2_v |> ggplot(aes(x = violence, y = PctForeignBorn, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctForeignBorn")
a8 <- crime2_v |> ggplot(aes(x = violence, y = PctBornSameState, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctBornSameState")
a9 <- crime2_v |> ggplot(aes(x = violence, y = PctSameHouse85, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctSameHouse85")
multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9, cols = 2)
Looking at the boxplots, it can be observed that in most cases there is a difference in median, variance and outliers between the two values of the target variable. For racepctblack for example the higher median for the value 1, indicates that communities with higher rate of African Americans seem to have above-average violence rates. The opposite holds for racePCtWhite where, despite being more clustered, the median for the percentage of Caucasians in below-average dangerous communities is close to 100.
Adding the chosen variables to the new dataset after the collinearity check done in the Murder analysis:
new_var <- crime2 |> select(PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn)
crime2_v <- cbind(new_var, crime2_v)
Violence:
a1 <- crime2_v |> ggplot(aes(x = violence, y = PolicReqPerOffic, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PolicReqPerOffic")
a2 <- crime2_v |> ggplot(aes(x = violence, y = PolicPerPop, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PolicPerPop")
a3 <- crime2_v |> ggplot(aes(x = violence, y = RacialMatchCommPol, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in RacialMatchCommPol")
a4 <- crime2_v |> ggplot(aes(x = violence, y = PctPolicWhite, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctPolicWhite")
a5 <- crime2_v |> ggplot(aes(x = violence, y = PctPolicAsian, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctPolicAsian")
a6 <- crime2_v |> ggplot(aes(x = violence, y = PctPolicMinor, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctPolicMinor")
a7 <- crime2_v |> ggplot(aes(x = violence, y = NumKindsDrugsSeiz, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in NumKindsDrugsSeiz")
a8 <- crime2_v |> ggplot(aes(x = violence, y = PolicAveOTWorked, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PolicAveOTWorked")
a9 <- crime2_v |> ggplot(aes(x = violence, y = LandArea, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in LandArea")
a10 <- crime2_v |> ggplot(aes(x = violence, y = PctUsePubTrans, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctUsePubTrans")
a11 <- crime2_v |> ggplot(aes(x = violence, y = PolicOperBudg, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PolicOperBudg")
a12 <- crime2_v |> ggplot(aes(x = violence, y = LemasPctPolicOnPatr, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in LemasPctPolicOnPatr")
a13 <- crime2_v |> ggplot(aes(x = violence, y = LemasGangUnitDeploy, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in LemasGangUnitDeploy")
a14 <- crime2_v |> ggplot(aes(x = violence, y = LemasPctOfficDrugUn, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in LemasPctOfficDrugUn")
multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9, cols = 2)
multiplot(a10, a11, a12, a13, a14, cols = 2)
Unfortunately, most boxplots turn out to be highly clustered and very similar among the two values of the target variable. This indicates that they have little predictive power, which however can be attributed to the high amount of NAs found in most of them. It highlights one more time the importance to have eliminated most of them during the collinearity check.
Adding the chosen variables to the new dataset after the collinearity check done in the Murder analysis:
Violence:
a1 <- crime2_v |> ggplot(aes(x = violence, y = MalePctNevMarr, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in MalePctNevMarr")
a2 <- crime2_v |> ggplot(aes(x = violence, y = TotalPctDiv, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in TotalPctDiv")
a3 <- crime2_v |> ggplot(aes(x = violence, y = PersPerFam, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PersPerFam")
a4 <- crime2_v |> ggplot(aes(x = violence, y = PctKids2Par, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctKids2Par")
a5 <- crime2_v |> ggplot(aes(x = violence, y = PctWorkMom, fill = factor(violence))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Violence in PctWorkMom")
multiplot(a1, a2, a3, a4 ,a5, cols = 2)
As opposed to variables related to the police force, here we can see the distribution of the values of the target variable within the independent variable well. TotalPctDiv (Total percentage of divorced people) for example tends to have its higher values in violence = 1, which suggests that several above-average dangerous communities have a higher divorce rate.
Lastly, we will exclude the variables with still high collinearity after having done a heat map in the Murder analysis to check across the four groups.
cor_matrix <-
crime2 |>
dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctIncNoMtg, racepctblack, racePctWhite, agePct12t29, PctImmigRecent, PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, TotalPctDiv, PersPerFam, PctWorkMom) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
We can see that no variable included in the new dataset has a collinearity equal or above 0.5.
Now we have to transform the numerical variables of different scales (except the target variable) to a common scale to better analyze and draw comparisons among them.
listconti= c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", "racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", "LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
vardep<-"violence"
means <-apply(crime2_v[,listconti],2,mean,na.rm=TRUE)
sds<-sapply(crime2_v[,listconti],sd,na.rm=TRUE)
crime_v_2<-scale(crime2_v[,listconti], center = means, scale = sds)
crime_v_final <- data.frame(crime_v_2)
new_var <- crime2 |> select(violence)
crime_v_final <- cbind(new_var, crime_v_final)
c("violence", "perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
vardep= c("violence")
variable_indepe = c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", "racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", "LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
y<-archivo_v[,vardep]
x<-archivo_v[,variable_indepe]
Before we continue we are going to use parallel computing to do a parallelization by using all the cores of the device except for 1 and creating a cluster.
GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
From this section onwards, the codes, visualizations and results are unique to the Violence Analysis and there is no overlap with the first part of the “Predicting Crime Rates throughout the USA” project.
We continue with the selection of variables, using different methods:
set.seed(123)
full<-glm(violence~.,data=archivo_v,family = binomial(link="logit"))
null<-glm(violence~1,data=archivo_v,family = binomial(link="logit"))
selec1<-stepAIC(null,scope=list(upper=full),
direction="both",family = binomial(link="logit"),trace=FALSE)
vec<-(names(selec1[[1]]))
length(vec)
[1] 13
dput(vec)
c("(Intercept)", "racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85",
"indianPerCap", "racepctblack", "LandArea", "LemasPctPolicOnPatr",
"PolicReqPerOffic")
AIC_list <- c("racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85",
"indianPerCap", "racepctblack", "LandArea", "LemasPctPolicOnPatr",
"PolicReqPerOffic")
We calculate k with the formula k=log(n): k=log(2215) -> k=3.345
set.seed(123)
full<-glm(violence~.,data=archivo_v,family = binomial(link="logit"))
null<-glm(violence~1,data=archivo_v,family = binomial(link="logit"))
selec1<-stepAIC(null,scope=list(upper=full),
direction="both",family = binomial(link="logit"),trace=FALSE,k=3.345)
vec<-(names(selec1[[1]]))
length(vec)
[1] 8
dput(vec)
c("(Intercept)", "racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85"
)
BIC_list <- c("racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85"
)
Boruta performed 20 iterations in 53.81288 secs.
33 attributes confirmed important: agePct12t29, AsianPerCap,
blackPerCap, indianPerCap, LandArea and 28 more;
No attributes deemed unimportant.
summary(out.boruta)
Length Class Mode
finalDecision 33 factor numeric
ImpHistory 720 -none- numeric
pValue 1 -none- numeric
maxRuns 1 -none- numeric
light 1 -none- logical
mcAdj 1 -none- logical
timeTaken 1 difftime numeric
roughfixed 1 -none- logical
call 3 -none- call
impSource 1 -none- character
sal<-data.frame(out.boruta$finalDecision)
sal2<-sal[which(sal$out.boruta.finalDecision=="Confirmed"),,drop=FALSE]
dput(row.names(sal2))
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
Boruta_list <-
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
set.seed(123)
mmpc1 <- MMPC(target = vardep, dataset=crime_v_final, max_k = 3, hash = TRUE,
test = "testIndLogistic")
mmpc1@selectedVars
[1] 3 8 11 31 32 34
c("blackPerCap", "MedRentPctHousInc", "racePctWhite", "LemasPctOfficDrugUn",
"TotalPctDiv", "PctWorkMom")
length(a)
[1] 6
MXM_list <- c("blackPerCap", "MedRentPctHousInc", "racePctWhite", "LemasPctOfficDrugUn",
"TotalPctDiv", "PctWorkMom")
set.seed(123)
SES1 <- SES(vardep, crime_v_final, max_k = 3, hash = TRUE,
test = "testIndLogistic")
SES1@selectedVars
[1] 3 8 11 31 32 34
c("blackPerCap", "MedRentPctHousInc", "racePctWhite", "LemasPctOfficDrugUn",
"TotalPctDiv", "PctWorkMom")
c("blackPerCap", "MedRentPctHousInc", "racePctWhite", "LemasPctOfficDrugUn",
"TotalPctDiv", "PctWorkMom")
length(a)
[1] 6
SES_list <- c("blackPerCap", "MedRentPctHousInc", "racePctWhite", "LemasPctOfficDrugUn",
"TotalPctDiv", "PctWorkMom")
set.seed(123)
filtro<-sbf(x,y,sbfControl =
sbfControl(functions = rfSBF,
method = "cv", verbose = FALSE))
a<-dput(filtro$optVariables)
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "racepctblack",
"racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly",
"PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic",
"RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor",
"LandArea", "PctUsePubTrans", "LemasPctPolicOnPatr", "LemasGangUnitDeploy",
"LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
length(a)
[1] 28
SBF_list <- c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "racepctblack",
"racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly",
"PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic",
"RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor",
"LandArea", "PctUsePubTrans", "LemasPctPolicOnPatr", "LemasGangUnitDeploy",
"LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
set.seed(123)
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(x, y, sizes=c(1:8), rfeControl=control)
selecrfe<-results$optVariables
length(selecrfe)
[1] 33
dput(selecrfe)
c("racePctWhite", "TotalPctDiv", "racepctblack", "blackPerCap",
"PctSpeakEnglOnly", "MedRentPctHousInc", "pctWWage", "PctForeignBorn",
"perCapInc", "agePct12t29", "PersPerFam", "PctSameHouse85", "PctWorkMom",
"OtherPerCap", "PctImmigRecent", "PctBornSameState", "AsianPerCap",
"PctUsePubTrans", "LandArea", "PolicReqPerOffic", "PctPolicMinor",
"indianPerCap", "LemasPctOfficDrugUn", "PctPolicWhite", "MedOwnCostPctIncNoMtg",
"NumKindsDrugsSeiz", "RacialMatchCommPol", "PolicOperBudg", "LemasGangUnitDeploy",
"PolicPerPop", "LemasPctPolicOnPatr", "PctPolicAsian", "PolicAveOTWorked"
)
RFE_list <- c("racePctWhite", "TotalPctDiv", "racepctblack", "blackPerCap",
"PctSpeakEnglOnly", "MedRentPctHousInc", "pctWWage", "perCapInc",
"PctForeignBorn", "agePct12t29", "PersPerFam", "PctSameHouse85",
"PctWorkMom", "OtherPerCap", "PctImmigRecent", "PctBornSameState",
"AsianPerCap", "PctUsePubTrans", "LandArea", "PolicReqPerOffic",
"PctPolicMinor", "indianPerCap", "LemasPctOfficDrugUn", "MedOwnCostPctIncNoMtg",
"PctPolicWhite", "NumKindsDrugsSeiz", "RacialMatchCommPol", "PolicOperBudg",
"LemasGangUnitDeploy", "PolicPerPop", "LemasPctPolicOnPatr",
"PctPolicAsian", "PolicAveOTWorked")
Let’s compare the algorithms with their different numbers of variables chosen.
crime_v_final2 <- crime_v_final
crime_v_final2$violence<-ifelse(crime_v_final2$violence==1,"Yes","No")
crime_v_final2$violence<-factor(crime_v_final2$violence)
crime_v_final2 <- na.omit(crime_v_final2 )
crime_v_final2 <-as_tibble(crime_v_final2 )
crime_v_final2 <- as.data.frame(crime_v_final2 )
data2<- crime_v_final2
medias1<-cruzadalogistica(data=data2,
vardep="violence",listconti=AIC_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias1 <- as.data.frame(medias1[[1]])
medias1$modelo="AIC"
medias2<-cruzadalogistica(data=data2,
vardep="violence",listconti=BIC_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias2 <- as.data.frame(medias2[[1]])
medias2$modelo="BIC"
medias3<-cruzadalogistica(data=data2,
vardep="violence",listconti=
Boruta_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias3 <- as.data.frame(medias3[[1]])
medias3$modelo="Boruta"
medias4<-cruzadalogistica(data=data2,
vardep="violence",listconti=RFE_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias4 <- as.data.frame(medias4[[1]])
medias4$modelo="RFE"
medias5<-cruzadalogistica(data=data2,
vardep="violence",listconti=
SBF_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias5 <- as.data.frame(medias5[[1]])
medias5$modelo="SBF"
medias6<-cruzadalogistica(data=data2,
vardep="violence",listconti=
MXM_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias6 <- as.data.frame(medias6[[1]])
medias6$modelo="MXM"
medias7<-cruzadalogistica(data=data2,
vardep="violence",listconti=
SES_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias7 <- as.data.frame(medias7[[1]])
medias7$modelo="SES"
union1<-rbind(medias1,medias2, medias3,
medias4,medias5,
medias6, medias7)
par(cex.axis=0.7)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
It can be seen that in terms of AUC, AIC, BIC, SES and MXM have very similar score. SES and MXM selected the exact same values, wherefore they are the same for Error Rate as well. BIC however, has the lowest Error Rate, so its variables will be chosen.
choose variables of BIC: c(“racePctWhite”, “TotalPctDiv”, “blackPerCap”, “PctSpeakEnglOnly”, “LemasPctOfficDrugUn”, “pctWWage”, “PctSameHouse85” )
| Model | Num. Variables | Variables |
|---|---|---|
| AIC | 12 | c(“racePctWhite”, “TotalPctDiv”, “blackPerCap”, “PctSpeakEnglOnly”, “LemasPctOfficDrugUn”, “pctWWage”, “PctSameHouse85”, “indianPerCap”, “racepctblack”, “LandArea”,“LemasPctPolicOnPatr”, “PolicReqPerOffic”) |
| Wrapper Boruta | 33 | c(“perCapInc”, “blackPerCap”, “indianPerCap”, “AsianPerCap”, “OtherPerCap”, “pctWWage”, “MedRentPctHousInc”, “MedOwnCostPctIncNoMtg”, “racepctblack”, “racePctWhite”,“agePct12t29”, “PctImmigRecent”, “PctSpeakEnglOnly”, “PctForeignBorn”, “PctBornSameState”, “PctSameHouse85”, “PolicReqPerOffic”, “PolicPerPop”, “RacialMatchCommPol”,“PctPolicWhite”, “PctPolicAsian”, “PctPolicMinor”, “NumKindsDrugsSeiz”, “PolicAveOTWorked”, “LandArea”, “PctUsePubTrans”, “PolicOperBudg”, “LemasPctPolicOnPatr”,“LemasGangUnitDeploy”, “LemasPctOfficDrugUn”, “TotalPctDiv”, “PersPerFam”, “PctWorkMom”) |
| MXM | 6 | c(“blackPerCap”, “MedRentPctHousInc”, “racePctWhite”, “LemasPctOfficDrugUn”, “TotalPctDiv”, “PctWorkMom”) |
| Wrapper SES | 6 | c(“blackPerCap”, “MedRentPctHousInc”, “racePctWhite”, “LemasPctOfficDrugUn”, “TotalPctDiv”, “PctWorkMom”) |
| SBF | 28 | c(“perCapInc”, “blackPerCap”, “indianPerCap”, “AsianPerCap”, “OtherPerCap”, “pctWWage”, “MedRentPctHousInc”, “racepctblack”, “racePctWhite”, “agePct12t29”,“PctImmigRecent”, “PctSpeakEnglOnly”, “PctForeignBorn”, “PctBornSameState”, “PctSameHouse85”, “PolicReqPerOffic”, “RacialMatchCommPol”,“PctPolicWhite”,“PctPolicAsian”, “PctPolicMinor”, “LandArea”, “PctUsePubTrans”, “LemasPctPolicOnPatr”, “LemasGangUnitDeploy”, “LemasPctOfficDrugUn”, “TotalPctDiv”, “PersPerFam”,“PctWorkMom”) |
| RFE | 33 | c(“racePctWhite”, “TotalPctDiv”, “racepctblack”, “blackPerCap”, “PctSpeakEnglOnly”, “MedRentPctHousInc”, “pctWWage”, “perCapInc”, “PctForeignBorn”, “agePct12t29”, “PersPerFam”, “PctSameHouse85”, “PctWorkMom”, “OtherPerCap”, “PctImmigRecent”, “PctBornSameState”, “AsianPerCap”, “PctUsePubTrans”, “LandArea”, “PolicReqPerOffic”, “PctPolicMinor”, “indianPerCap”, “LemasPctOfficDrugUn”, “MedOwnCostPctIncNoMtg”, “PctPolicWhite”, “NumKindsDrugsSeiz”, “RacialMatchCommPol”, “PolicOperBudg”,“LemasGangUnitDeploy”, “PolicPerPop”, “LemasPctPolicOnPatr”, “PctPolicAsian”, “PolicAveOTWorked”) |
| BIC | 7 | c(“racePctWhite”, “TotalPctDiv”, “blackPerCap”, “PctSpeakEnglOnly”, “LemasPctOfficDrugUn”, “pctWWage”, “PctSameHouse85”) |
Now that we have decided which variables to go with, let’s look for the best parameters of the network, first without and then with early stopping. First of all it will be important to calculate the number of nodes with the formula: N of parameters =h(k+1)+h+1. We have 2215 observations and 8 variables and we want 30 observations per parameter. That implies that we will have 2215/30= 74 parameters.
With 5 hidden nodes, we will need at least 1530 observations:
5(8+1)+5+1 = 51 -> 51x30= 1530
with 7 nodes it would be 2130 observations.
Hopefully we will find an adequate number of nodes to avoid over-fitting (too many nodes) or under-fitting (too few nodes). In the next section we will tune our network with repeated cross validation with avnnet.
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all"
,classProbs = TRUE, repeats=4,
summaryFunction=twoClassSummary
)
set.seed(123)
nnetgrid <- expand.grid(size=c(5, 6,8, 10),decay=c(0.1, 0.01, 0.001, 0.0001),bag=F)
completo<-data.frame()
listaiter<-c(10, 20, 50, 100, 200)
for (iter in listaiter)
{
rednnet<- train(violence~.,
data=data2,
method="avNNet",linout = FALSE,maxit=iter,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
We want a ROC as high as possible with the least iterations possible. We see that between 20 and 50 iterations the curve flattens. So let’s concentrate on this range. We will increase the learning since 0.1 seems to be the best.
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all"
,classProbs = TRUE, repeats=4,
summaryFunction=twoClassSummary
)
set.seed(123)
nnetgrid <- expand.grid(size=c(5, 6,8, 10),decay=c(0.2, 0.15, 0.1, 0.01),bag=F)
completo<-data.frame()
listaiter<-c(20, 25, 30, 35, 40, 45, 50)
for (iter in listaiter)
{
rednnet<- train(violence~.,
data=data2,
method="avNNet",linout = FALSE,maxit=iter,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
The best number of iterations is 40, the best learning rate 0.15 and the best number of nodes is 8. We leave it tuned like that and do early stopping so that it is not overfitted. But before we compare it to the previous models, we are going to see whether we get a better performance with the mentioned parameters and a maxit of 100.
set.seed(123)
nnetgrid <- expand.grid(size=c(8),decay=c(0.15),bag=F)
completo<-data.frame()
listaiter<-c(40)
for (iter in listaiter)
{
rednnet<- train(violence~.,
data=data2,
method="avNNet",linout = FALSE,maxit=100,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F)
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
The ROC is not higher with the maxit at 100, so we will continue with the network without including the maxit.
data2<-crime_v_final2[,c(BIC_list, "violence")]
medias8<-cruzadaavnnetbin(data=data2, vardep="violence",
listconti=BIC_list, listclass=c(""),
grupos=4, sinicio=1234, repe=25,
repeticiones=5, itera=40, size=c(8),
decay=c(0.15))
size decay bag Accuracy Kappa AccuracySD KappaSD
1 8 0.15 FALSE 0.8492429 0.6432629 0.01207127 0.02881791
medias8 <- as.data.frame(medias8[[1]])
medias8$modelo="Network"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8)
par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="TAS DE FALLOS")
Thanks to the tuning done, the neuronal network is the best model now.
We will continue with the Bagging algorithm and find out which sampsize and which ntree are optimal by performing a tuning. For this we will do a repeated cross-validation with caret. We will select the value of mtry equal to the number of independent variables we have obtained from the BIC model. In addition to Accuracy, we will also compare the means of Kappa. Kappa is important because it tells us how reliable or consistent our ratings are.It helps us assess whether the evaluators are in general agreement or whether their decisions are more random. That is, a high Kappa value helps to avoid incorrect interpretations or conclusions due to random agreement. Therefore, it would be best to have a Kappa superior to 0.6.
It is important to note that sampsize is the parameter that allows us to achieve class balance. As seen above, the balance of the values in target variable violence could be better, therefore we will take it into account to further improve our model.To determine its appropriate value, we will perform a repeated cross-validation with different values in the range of 10 to 200, since the minimum number of sampsize should not be too high, to avoid overfitting.
set.seed(123)
nmin <- min(table(data2$violence))
mtry <- c(length(BIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c(10, 50, 100, 200)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(violence)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 10, 50, 100, 200
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.7685353 0.7999097 0.8083183 0.8078564 0.8145307 0.8393502 0
50 0.8054054 0.8247748 0.8337865 0.8344088 0.8427473 0.8664260 0
100 0.8101266 0.8284423 0.8337865 0.8362108 0.8469531 0.8607595 0
200 0.8101266 0.8264977 0.8355916 0.8360327 0.8445548 0.8643761 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.5089491 0.5685028 0.5860396 0.5839635 0.5942335 0.6512371 0
50 0.5685473 0.6080660 0.6275989 0.6305870 0.6526079 0.6987228 0
100 0.5727970 0.6154095 0.6274828 0.6345590 0.6584523 0.6827782 0
200 0.5714665 0.6112236 0.6321331 0.6344116 0.6527867 0.6900433 0
We can see that the best sampsize between 100 and 200, since Accuracy and Kappa are the highest, above 0.8 and 0.6. These are already acceptable scores however, we can recognize that they are still increasing with the sampsize. Therefore, we are going to tune by choosing parameters between 100 and 200.
set.seed(123)
#message("Numero mĂnimo en sampsize es ",(nmin*0.75))
nmin <- min(table(data2$violence))
mtry <- c(length(BIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c(100, 122, 150, 175, 200, 250)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(violence)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 100, 122, 150, 175, 200, 250
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.8101266 0.8284423 0.8337865 0.8362108 0.8469531 0.8607595 0
122 0.8028933 0.8243007 0.8401091 0.8372967 0.8478573 0.8610108 0
150 0.8137432 0.8249097 0.8366470 0.8359418 0.8451472 0.8607595 0
175 0.8090090 0.8243785 0.8373999 0.8357629 0.8469531 0.8625678 0
200 0.8101266 0.8264977 0.8355916 0.8360327 0.8445548 0.8643761 0
250 0.8126126 0.8230247 0.8372514 0.8348597 0.8427473 0.8625678 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.5727970 0.6154095 0.6274828 0.6345590 0.6584523 0.6827782 0
122 0.5565226 0.6153075 0.6391886 0.6370387 0.6637287 0.6849436 0
150 0.5783158 0.6167000 0.6357378 0.6341633 0.6565171 0.6837724 0
175 0.5803612 0.6131092 0.6333629 0.6338510 0.6585991 0.6883675 0
200 0.5714665 0.6112236 0.6321331 0.6344116 0.6527867 0.6900433 0
250 0.5851786 0.6065148 0.6372044 0.6324643 0.6476434 0.6893396 0
Highest Kappa seems to be with a sampsize of 122.
We will use sampsize 122 to determine the number of trees to be included in the bagging.
ntrees <- c(50, 100, 200, 300)
modellist <- list()
for (ntree in ntrees){
set.seed(12345)
rf<- train(factor(violence)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(122, 122))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 50, 100, 200, 300
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
50 0.8090090 0.8189609 0.8249092 0.8310673 0.8416065 0.8610108 0
100 0.8086643 0.8236373 0.8355916 0.8338673 0.8431284 0.8625678 0
200 0.8065099 0.8242210 0.8346890 0.8348586 0.8453888 0.8625678 0
300 0.8047016 0.8252065 0.8372514 0.8361236 0.8499773 0.8610108 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
50 0.5782276 0.6019154 0.6157800 0.6250364 0.6472114 0.6840476 0
100 0.5774279 0.6074544 0.6354361 0.6301302 0.6518502 0.6883675 0
200 0.5686773 0.6128163 0.6275349 0.6325122 0.6578602 0.6873894 0
300 0.5599080 0.6136666 0.6343901 0.6346832 0.6635653 0.6849436 0
We can see the Accuracy is above 0.8 which is good, but the Kappa could be closer to 0.7. Therefore we will try ntrees between 200 and 300.
data3 <- data2
ntrees <- c(200, 225, 250, 275, 300, 325)
modellist <- list()
for (ntree in ntrees){
set.seed(12345)
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(122, 122))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 200, 225, 250, 275, 300, 325
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
200 0.8065099 0.8242210 0.8346890 0.8348586 0.8453888 0.8625678 0
225 0.8083183 0.8264186 0.8348375 0.8346781 0.8453888 0.8628159 0
250 0.8119349 0.8243243 0.8372514 0.8356722 0.8467450 0.8607595 0
275 0.8047016 0.8261074 0.8381555 0.8359424 0.8490054 0.8607595 0
300 0.8047016 0.8252065 0.8372514 0.8361236 0.8499773 0.8610108 0
325 0.8028933 0.8238503 0.8372514 0.8356718 0.8487590 0.8625678 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
200 0.5686773 0.6128163 0.6275349 0.6325122 0.6578602 0.6873894 0
225 0.5720502 0.6112694 0.6309594 0.6320117 0.6548550 0.6895168 0
250 0.5762077 0.6129151 0.6349685 0.6339467 0.6586226 0.6817778 0
275 0.5612723 0.6126859 0.6378264 0.6346811 0.6632761 0.6827782 0
300 0.5599080 0.6136666 0.6343901 0.6346832 0.6635653 0.6849436 0
325 0.5565226 0.6128079 0.6349339 0.6338663 0.6640487 0.6873894 0
We can see the kappa is rising slowly when having ntrees above 225. ntrees=300 has the highest Accuracy and Kappa for now.
ntrees <- c(260, 300, 350, 400)
modellist <- list()
for (ntree in ntrees){
set.seed(12345)
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(122, 122))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 260, 300, 350, 400
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
260 0.8137432 0.8256569 0.8372514 0.8358523 0.8481013 0.8610108 0
300 0.8047016 0.8252065 0.8372514 0.8361236 0.8499773 0.8610108 0
350 0.8028933 0.8252065 0.8381555 0.8358530 0.8474052 0.8643761 0
400 0.8047016 0.8260285 0.8392049 0.8366656 0.8487590 0.8643761 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
260 0.5809342 0.6136666 0.6344046 0.6341784 0.6612616 0.6849436 0
300 0.5599080 0.6136666 0.6343901 0.6346832 0.6635653 0.6849436 0
350 0.5565226 0.6136666 0.6355641 0.6340658 0.6628436 0.6919861 0
400 0.5599080 0.6120067 0.6374280 0.6358937 0.6641789 0.6910177 0
We are going to choose ntrees = 400, since it has the highest Accuracy and Kappa.
We will start with the selection of mtry, which refers to the number of variables randomly selected in each split when constructing each tree in the ensemble. Choosing the right value of mtry is important because it affects the randomness and diversity of the randomforest trees. If mtry is too low, each tree may have a limited view of the available features, which may lead to insufficient fit. On the other hand, if mtry is too high, trees may be more similar, increasing the risk of underfitting. We will use the ntrees found with bagging (400).
set.seed(12345)
rfgrid<-expand.grid(mtry=c(2, 3, 5, 7, 9, 10, 20))
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,ntree=400,nodesize=20,replace=TRUE,
importance=T)
rf
Random Forest
2215 samples
7 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (4 fold, repeated 5 times)
Summary of sample sizes: 1661, 1662, 1662, 1660, 1661, 1662, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8455133 0.6303142
3 0.8460570 0.6322639
5 0.8437070 0.6271582
7 0.8437066 0.6271099
9 0.8441569 0.6285469
10 0.8448789 0.6297731
20 0.8443376 0.6294648
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 3.
We can see that once again Kappa is slightly above 0.6 but the peak seems to be 3, so we going to run it again and also doing mtry=25 just to make sure the performance is not increasing unexpectedly.
set.seed(12345)
rfgrid<-expand.grid(mtry=c(1.5, 2, 2.5, 3, 3.5 , 4, 25))
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,ntree=400,nodesize=20,replace=TRUE,
importance=T)
rf
Random Forest
2215 samples
7 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (4 fold, repeated 5 times)
Summary of sample sizes: 1661, 1662, 1662, 1660, 1661, 1662, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
1.5 0.8455133 0.6303142
2.0 0.8460543 0.6316632
2.5 0.8466869 0.6332744
3.0 0.8461441 0.6332059
3.5 0.8450609 0.6309832
4.0 0.8456028 0.6320671
25.0 0.8443376 0.6294648
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 2.5.
Between 2 and 3 seems to be the best mtry which is also supported by this graph:
note: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .
Now we will find out the correct amount of sampsize for the Random Forest. For this we will use the mtry (2) and the ntree (400) obtained.
set.seed(12345)
rfgrid<-expand.grid(.mtry=2)
modellist <- list()
sizes <- c(10, 50, 100, 200)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=400,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 10, 50, 100, 200
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.7703436 0.7874549 0.7971135 0.8008098 0.8138273 0.8393502 0
50 0.8173599 0.8229730 0.8357401 0.8361246 0.8499773 0.8607595 0
100 0.8137432 0.8288288 0.8384477 0.8394647 0.8490054 0.8679928 0
200 0.8137432 0.8264977 0.8375451 0.8381119 0.8461191 0.8698011 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.5232532 0.5584621 0.5793858 0.5834395 0.6088215 0.6520346 0
50 0.5928637 0.6084458 0.6294359 0.6356625 0.6662213 0.6873091 0
100 0.5898483 0.6204788 0.6374482 0.6426184 0.6647289 0.6997963 0
200 0.5873403 0.6152095 0.6340219 0.6398775 0.6590078 0.7038425 0
Accuracy is once again above 0.8 and Kappa above 0.6 with a sampsize of 100. We will see whether there are higher values between 100 and 200.
set.seed(12345)
rfgrid<-expand.grid(.mtry=2)
modellist <- list()
sizes <- c(100, 125 ,150, 175, 200, 225)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=400,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 100, 125, 150, 175, 200, 225
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.8137432 0.8288288 0.8384477 0.8394647 0.8490054 0.8679928 0
125 0.8083183 0.8261261 0.8383024 0.8391943 0.8492111 0.8716094 0
150 0.8101266 0.8264186 0.8392033 0.8394664 0.8492111 0.8698011 0
175 0.8173599 0.8267148 0.8375451 0.8399186 0.8503617 0.8716094 0
200 0.8137432 0.8264977 0.8375451 0.8381119 0.8461191 0.8698011 0
225 0.8065099 0.8281466 0.8393502 0.8369380 0.8483755 0.8643761 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.5898483 0.6204788 0.6374482 0.6426184 0.6647289 0.6997963 0
125 0.5759693 0.6138411 0.6390291 0.6424693 0.6675938 0.7074968 0
150 0.5806097 0.6181126 0.6418690 0.6429217 0.6655716 0.7019716 0
175 0.5965865 0.6221099 0.6373855 0.6444367 0.6668168 0.7084135 0
200 0.5873403 0.6152095 0.6340219 0.6398775 0.6590078 0.7038425 0
225 0.5686773 0.6170447 0.6391732 0.6376263 0.6625612 0.6939047 0
Kappa and Accuracy seem to be the highest around sampsize 175.
set.seed(12345)
rfgrid<-expand.grid(.mtry=2)
modellist <- list()
sizes <- c(165, 175, 185, 195)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(violence)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=400,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 165, 175, 185, 195
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
165 0.8144144 0.8266365 0.8402527 0.8391059 0.8474729 0.8679928 0
175 0.8173599 0.8267148 0.8375451 0.8399186 0.8503617 0.8716094 0
185 0.8101266 0.8276173 0.8437225 0.8391933 0.8470217 0.8698011 0
195 0.8101266 0.8269490 0.8410116 0.8404588 0.8510830 0.8679928 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
165 0.5878960 0.6177602 0.6420505 0.6428462 0.6626374 0.7001998 0
175 0.5965865 0.6221099 0.6373855 0.6444367 0.6668168 0.7084135 0
185 0.5767394 0.6175152 0.6500419 0.6419618 0.6594246 0.7047692 0
195 0.5780375 0.6169350 0.6448067 0.6449222 0.6678436 0.7001998 0
The best sampsize is 195.
Now we will compare bagging and random forest with our previous models.
medias9<-cruzadarfbin(data=data2,
vardep="violence",listconti= BIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
nodesize=20,mtry=length(BIC_list),ntree=400,replace=TRUE,sampsize=122)
mtry Accuracy Kappa AccuracySD KappaSD
1 7 0.8390802 0.6109187 0.01432219 0.03531606
medias9 <- as.data.frame(medias9[[1]])
medias9$modelo <- "Bagging"
medias10<-cruzadarfbin(data=data2,
vardep="violence",listconti= BIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
nodesize=20,mtry=2,ntree=400,replace=TRUE,sampsize=195)
mtry Accuracy Kappa AccuracySD KappaSD
1 2 0.8438479 0.6236974 0.01336902 0.03236201
medias10 <- as.data.frame(medias10[[1]])
medias10$modelo <- "RF"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8, medias9, medias10)
par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
We can see that Random Forest has now the highest AUC. However, its Error Rate is still above the one of Neuronal Networks. Bagging has the worst Error Rate and is among the lowest AUCs.
We will continue with the Gradient Boosting model. This is a method that combines multiple weak learners (decision trees) to create a robust prediction model. The final model prediction is obtained by aggregating the predictions of all the individual models in the ensemble.
To develop the best possible ensemble, we will take into account Accuracy and Kappa and tune four parameters:
GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
control<-trainControl(method = "repeatedcv",number=4,repeats=3,
savePredictions = "all",classProbs=TRUE)
set.seed(23)
gbmgrid <-expand.grid(n.minobsinnode=c(5,10,15),
shrinkage=c(0.1,0.01),n.trees=c(50, 100,500,1000),
interaction.depth=c(2, 4, 6, 8))
gbm<- train(violence~.,data=data3,
method="gbm",trControl=control,
tuneGrid=gbmgrid,distribution="bernoulli",verbose=FALSE)
plot(gbm)
gbm$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
69 50 4 0.1 15
sorted_results <- gbm$results[order(-gbm$results$Accuracy), ]
print(sorted_results[1, c("Accuracy", "Kappa")])
Accuracy Kappa
69 0.8471101 0.6362075
We can see that the best model has a Kappa of 0.64 and an Accuracy of 0.85. The parameters of this model (number 69) are ntrees=50, interaction.depth=4, shrinkage=0.1 and n.minobsinnode = 15. For the tuning we will choose parameters around the ones mentioned to see whether we can find a better model.
control<-trainControl(method = "repeatedcv",number=4,repeats=3,
savePredictions = "all",classProbs=TRUE)
set.seed(23)
gbmgrid <-expand.grid(n.minobsinnode=c(13, 15,17, 20),
shrinkage=c(0.005, 0.1, 0.25, 0.5),n.trees=c(40, 50,60),
interaction.depth=c(3, 4, 5, 6))
gbm<- train(violence~.,data=data3,
method="gbm",trControl=control,
tuneGrid=gbmgrid,distribution="bernoulli",verbose=FALSE)
plot(gbm)
gbm$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
107 50 3 0.25 20
sorted_results <- gbm$results[order(-gbm$results$Accuracy), ]
print(sorted_results[1, c("Accuracy", "Kappa")])
Accuracy Kappa
107 0.8478605 0.6393972
We see that the kappa improved slightly and will continue to tune.
control<-trainControl(method = "repeatedcv",number=4,repeats=3,
savePredictions = "all",classProbs=TRUE)
set.seed(23)
gbmgrid <-expand.grid(n.minobsinnode=c(15, 20, 25),
shrinkage=c(0.25, 0.2, 0.3),n.trees=c(45, 50,55),
interaction.depth=c(2, 2.5, 3, 3.5))
gbm<- train(violence~.,data=data3,
method="gbm",trControl=control,
tuneGrid=gbmgrid,distribution="bernoulli",verbose=FALSE)
plot(gbm)
gbm$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
20 50 3 0.2 15
sorted_results <- gbm$results[order(-gbm$results$Accuracy), ]
print(sorted_results[1, c("Accuracy", "Kappa")])
Accuracy Kappa
20 0.8466607 0.6361663
There is no improvement in Kappa so we stick to the parameters of model number 107. n.minobsinnode= 20, shrinkage= 0.25, n.trees= 50, interaction.depth=3, and an accuracy of 0.85 and kappa of 0.64.
Lets use these parameters to see whether a tune of bag.fraction (which is the draw of observations) gives us a better performance of the model. For this parameter we will choose numbers 0.5 and 0.8.
set.seed(3)
gbmgrid <-expand.grid(n.minobsinnode=c(20),
shrinkage=c(0.25),n.trees=c(50),
interaction.depth=c(3))
set.seed(3)
gbm<- train(factor(violence)~.,data=data3,tuneGrid=gbmgrid,
method="gbm",trControl=control,distribution="bernoulli",
bag.fraction=1,verbose=FALSE)
gbm1<- train(factor(violence)~.,data=data3,tuneGrid=gbmgrid, method="gbm",trControl=control,distribution="bernoulli", bag.fraction=0.8,verbose=FALSE)
gbm2<- train(factor(violence)~.,data=data3,tuneGrid=gbmgrid, method="gbm",trControl=control,distribution="bernoulli", bag.fraction=0.5,verbose=FALSE)
sorted_results1 <- gbm1$results[order(-gbm1$results$Accuracy), ]
sorted_results <- gbm2$results[order(-gbm2$results$Accuracy), ]
combined_results <- rbind(sorted_results, sorted_results1)
combined_results
n.minobsinnode shrinkage n.trees interaction.depth Accuracy
1 20 0.25 50 3 0.8410858
2 20 0.25 50 3 0.8400339
Kappa AccuracySD KappaSD
1 0.6225564 0.01295425 0.02725542
2 0.6207521 0.01538043 0.03739678
We see that having a bag.fraction of 0.5 or 0.8 does not bring us much benefit in terms of Accuracy and Kappa, so we will continue without this parameter.
medias11<-cruzadagbmbin(data=data3,
vardep="violence",listconti=BIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
n.minobsinnode=20,shrinkage=0.25,n.trees=50,
interaction.depth=3)
n.minobsinnode shrinkage n.trees interaction.depth Accuracy
1 20 0.25 50 3 0.8399839
Kappa AccuracySD KappaSD
1 0.6203054 0.01397385 0.03335217
medias11 <- as.data.frame(medias11[[1]])
medias11$modelo <- "GBM"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8, medias9, medias10, medias11)
par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
Thanks to its boxplot, we can see that GBMs values have a high variance in AUC and it is still not particularly higher than the already existing models. In terms of Error Rate it does not stand out either.
We continue with the Support Vector Machine. The goal of this model is to find the best way to separate the data in different classes with a line, so we are going to use three types of kernel: linear, polynomial, RBF (radial basis function) and tuning the parameters C, Degree and Sigma.
For linear SVM only a suitable C needs to be found. This parameter is necessary to control the balance between bias and overfitting.
set.seed(1234)
control<-trainControl(method = "repeatedcv",number=4,repeats=5,
savePredictions = "all",classProbs=TRUE)
# Aplico caret y construyo modelo
SVMgrid <-expand.grid(C=c(0.5,1,2,5,10))
SVMlin<- train(violence~.,data=data3,
method="svmLinear",trControl=control,
tuneGrid=SVMgrid,replace=replace)
print(SVMlin$results)
C Accuracy Kappa AccuracySD KappaSD
1 0.5 0.8415272 0.6140041 0.01294488 0.03053962
2 1.0 0.8417071 0.6143691 0.01389164 0.03301550
3 2.0 0.8422487 0.6158308 0.01318269 0.03104315
4 5.0 0.8420690 0.6152620 0.01288482 0.03031820
5 10.0 0.8419781 0.6150876 0.01282557 0.03003783
We can see that all Cs surpass an Accuracy of 0.8 and Kappa of 0.6. C=2 seems to have the highest score. Lets choose some parameters in around it and visualize the results.
set.seed(13)
SVMgrid<-expand.grid(C=c(1, 2.5, 3, 3.5, 4, 4.5, 5))
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid')
SVMlin<- train(violence~.,data=data3,
method="svmLinear",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)
completo<-data.frame()
completo<-rbind(completo,SVMlin$results)
completo<-completo[order(completo$Accuracy),]
ggplot(completo, aes(x=factor(C), y=Accuracy)) +
geom_point(position=position_dodge(width=0.5),size=3)
completo<-data.frame()
completo<-rbind(completo,SVMlin$results)
completo<-completo[order(completo$Kappa),]
ggplot(completo, aes(x=factor(C), y=Kappa)) +
geom_point(position=position_dodge(width=0.5),size=3)
We can see that the best C is C=3 for SMV Linear.
Apart from C, we are also going to find out which is the best sigma. Sigma refers to the width of the kernel. A wider kernel means that more distant points are considered more similar, the opposite is true for a narrow kernel.
set.seed(123456)
SVMgrid<-expand.grid(C=c(0.0001,0.5,1,2,5,10),
sigma=c(0.0001,0.005,0.01,0.05,0.7))
control<-trainControl(method = "cv",
number=4,savePredictions = "all")
SVMRBF<- train(violence~.,data=data3,
method="svmRadial",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)
sorted_results <- SVMRBF$results[order(-SVMRBF$results$Accuracy), ]
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "sigma")])
Accuracy Kappa C sigma
27 0.8483091 0.6369888 10 0.005
18 0.8474082 0.6340350 2 0.010
22 0.8474058 0.6338581 5 0.005
All top 3 models have an acceptable Accuracy and Kappa. Lets see whether there are better Cs and sigmas close to these parameters of the three models and let’s also visualize it.
set.seed(1234)
SVMgrid<-expand.grid(C=c(5, 7, 10, 12, 15),
sigma=c(0.001, 0.005, 0.007, 0.01, 0.03, 0.05, 0.07))
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all")
SVMRBF <- train(violence~.,data=data3,
method="svmRadial",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)
sorted_results <- SVMRBF$results[order(-SVMRBF$results$Accuracy), ]
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "sigma")])
Accuracy Kappa C sigma
14 0.8487481 0.6353723 7 0.07
34 0.8478497 0.6325623 15 0.05
27 0.8473952 0.6322370 12 0.05
completo<-data.frame()
completo<-rbind(completo,SVMRBF$results)
completo<-completo[order(completo$Kappa),]
ggplot(completo, aes(x=factor(C), y=Kappa,
color=factor(sigma))) +
geom_point(position=position_dodge(width=0.5),size=3)
completo<-rbind(completo,SVMRBF$results)
completo<-completo[order(completo$Accuracy),]
ggplot(completo, aes(x=factor(C), y=Accuracy,
color=factor(sigma))) +
geom_point(position=position_dodge(width=0.5),size=3)
The best sigma is 0.07 and the best C is 7.7 for the best Accuracy and Kappa
For this kernel we will use degree and scale apart from C. Degree refers to the flexibility and shape of the line separating the different data classes. A degree that is too high can lead to overfitting, as the curve will be more complex.
set.seed(12345)
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
savePredictions = "all",search = 'grid')
SVMgrid <-expand.grid(C=c(0.1, 0.3, 0.5, 0.7, 0.9),degree=c(2,3,4),scale=c(1, 2, 3))
SVM_poli<- train(violence~.,data=data3,
method="svmPoly",trControl=control,
tuneGrid=SVMgrid,replace=replace)
sorted_results <- SVM_poli$results[order(-SVM_poli$results$Accuracy), ]
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "degree", "scale")])
Accuracy Kappa C degree scale
2 0.8415434 0.6183269 0.1 2 2
4 0.8415401 0.6208347 0.1 3 1
3 0.8410946 0.6174673 0.1 2 3
We can see that the model 4 seems to be the best. Lets see if we can find a higher Accuracy and Kappa with similar parameters.
set.seed(12345)
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
savePredictions = "all",search = 'grid')
SVMgrid <-expand.grid(C=c(0.05, 0.1, 0.15, 0.2),degree=c(1, 1.5, 2, 2.5,3),scale=c(0.5, 0, 1, 1.5, 2, 2.5, 3))
SVM_poli<- train(violence~.,data=data3,
method="svmPoly",trControl=control,
tuneGrid=SVMgrid,replace=replace)
sorted_results <- SVM_poli$results[order(-SVM_poli$results$Accuracy), ]
# Print the top 3 rows with highest accuracy and kappa values
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "degree", "scale")])
Accuracy Kappa C degree scale
65 0.8474123 0.6337275 0.10 3 0.5
30 0.8474123 0.6320819 0.05 3 0.5
100 0.8456048 0.6299748 0.15 3 0.5
set.seed(12345)
completo<-data.frame()
completo<-rbind(completo,SVM_poli$results)
completo<-completo[order(completo$Accuracy),]
plot1 <- ggplot(completo, aes(x=factor(C), y=Accuracy,
color=factor(scale),pch=factor(degree))) +
geom_point(position=position_dodge(width=0.5),size=3)
plot1
completo<-data.frame()
completo<-rbind(completo,SVM_poli$results)
completo<-completo[order(completo$Kappa),]
plot1 <- ggplot(completo, aes(x=factor(C), y=Kappa,
color=factor(scale),pch=factor(degree))) +
geom_point(position=position_dodge(width=0.5),size=3)
plot1
For SVM Polinomial we choose the parameters C=0.048, scale=0.5 and degree= 3
medias12<-cruzadaSVMbin(data=data3,
vardep="violence",listconti= BIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,C=3)
C Accuracy Kappa AccuracySD KappaSD
1 3 0.8417545 0.6138962 0.01315963 0.03280717
medias12 <- as.data.frame(medias12[[1]])
medias12$modelo="SVM_linear"
medias13<-cruzadaSVMbinRBF(data=data3,
vardep="violence",listconti= BIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
C=7.7,sigma=0.07)
C sigma Accuracy Kappa AccuracySD KappaSD
1 7.7 0.07 0.8442634 0.6187234 0.01328779 0.03413729
medias13 <- as.data.frame(medias13[[1]])
medias13$modelo="SVM_RBF"
medias14<- cruzadaSVMbinPoly(data=data3,
vardep="violence",listconti= BIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
C=0.048,degree=3,scale=0.5)
C degree scale Accuracy Kappa AccuracySD KappaSD
1 0.048 3 0.5 0.8300884 0.5663375 0.019856 0.06594605
medias14 <- as.data.frame(medias14[[1]])
medias14$modelo="SVM_Poli"
union1<-rbind(medias1,medias2,medias3,medias4, medias4, medias5, medias6, medias7, medias8, medias9, medias10, medias11, medias12,
medias13,
medias14
)
par(cex.axis=0.4)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
Unfortunately, SVM RBF and Poli have the worst AUC scores by far, while SVM Poli even has the highest Error Rate with high variance. The best model of SVM is linear.
As one of the last steps, we are going to do an ensemble to combine our models. We have already tuned a variety of models, so we will use the best 5 algorithms (Neuronal network, Random Forest, SES, BIC and AIC). SES and MXM have the exact same variables and performance, wherefore we can only choose one, in this case SES. We will include a logistic crossover based on BIC variables and AIC variables.
vardep<-"violence"
listconti<-c("racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85"
)
variables2<-c("racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85"
)
listconti2 <- c("racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85",
"indianPerCap", "racepctblack", "LandArea", "LemasPctPolicOnPatr",
"PolicReqPerOffic")
listconti3 <- c("blackPerCap", "MedRentPctHousInc", "racePctWhite", "LemasPctOfficDrugUn",
"TotalPctDiv", "PctWorkMom")
data4<-data2[,c(variables2,vardep)]
listclass<-c("")
grupos<-4
sinicio<-1234
repe<-50
set.seed(1234)
archivo_v$violence<-ifelse(archivo_v$violence==1,"Yes","No")
archivo_v <- na.omit(archivo_v)
archivo_v <-as_tibble(archivo_v)
archivo_v <- as.data.frame(archivo_v)
dput(names(archivo_v))
c("violence", "perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
medias1bis<-cruzadalogistica(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe)
predi1<-as.data.frame(medias1bis[2])
predi1$logi<-predi1$Yes
medias2bis<-cruzadaavnnetbin(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,repeticiones=25,itera=40,
size=c(8),decay=c(0.15))
size decay bag Accuracy Kappa AccuracySD KappaSD
1 8 0.15 FALSE 0.8494528 0.6438057 0.01220168 0.02844946
predi2<-as.data.frame(medias2bis[2])
predi2$network<-predi2$Yes
medias3bis<-cruzadalogistica(data=crime_v_final2,
vardep=vardep,listconti=listconti3,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe)
predi3<-as.data.frame(medias3bis[2])
predi3$SES<-predi3$Yes
medias4bis<-cruzadarfbin(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,
sinicio=sinicio,repe=repe,nodesize=20,mtry=2,ntree=400,replace=TRUE,sampsize=300)
mtry Accuracy Kappa AccuracySD KappaSD
1 2 0.8446764 0.6260848 0.01344127 0.03261371
predi4<-as.data.frame(medias4bis)
predi4$RF<-predi4$Yes
medias5bis<-cruzadalogistica(data=crime_v_final2,
vardep=vardep,listconti=listconti2,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe)
predi5<-as.data.frame(medias5bis[2])
predi5$AIC<-predi5$Yes
set.seed(123)
unipredi<-cbind(predi1, predi2, predi3, predi4, predi5)
#We eliminate duplicate columns
unipredi<- unipredi[, !duplicated(colnames(unipredi))]
#Combinations of 2
unipredi$predi6<-(unipredi$RF+unipredi$network)/2
unipredi$predi7<-(unipredi$RF+unipredi$SES)/2
unipredi$predi8<-(unipredi$RF+unipredi$AIC)/2
unipredi$predi9<-(unipredi$RF+unipredi$logi)/2
unipredi$predi10<-(unipredi$logi+unipredi$network)/2
unipredi$predi11<-(unipredi$logi+unipredi$SES)/2
unipredi$predi12<-(unipredi$network+unipredi$SES)/2
unipredi$predi13<-(unipredi$network+unipredi$AIC)/2
unipredi$predi14<-(unipredi$SES+unipredi$AIC)/2
unipredi$predi15<-(unipredi$logi+unipredi$AIC)/2
#Combinations of 3
unipredi$predi16<-(unipredi$SES+unipredi$network+unipredi$logi)/3
unipredi$predi17<-(unipredi$SES+unipredi$RF+unipredi$logi)/3
unipredi$predi18<-(unipredi$SES+unipredi$network+unipredi$RF)/3
unipredi$predi19<-(unipredi$logi+unipredi$RF+unipredi$network)/3
unipredi$predi20<-(unipredi$RF+unipredi$network+unipredi$AIC)/3
unipredi$predi21<-(unipredi$logi+unipredi$RF+unipredi$AIC)/3
unipredi$predi22<-(unipredi$AIC+unipredi$RF+unipredi$SES)/3
unipredi$predi23<-(unipredi$logi+unipredi$AIC+unipredi$network)/3
unipredi$predi24<-(unipredi$logi+unipredi$AIC+unipredi$SES)/3
unipredi$predi25<-(unipredi$SES+unipredi$AIC+unipredi$network)/3
#Combinations of 4
unipredi$predi26<-(unipredi$RF+unipredi$network+unipredi$AIC+unipredi$logi)/4
unipredi$predi27<-(unipredi$RF+unipredi$AIC+unipredi$logi+unipredi$SES)/4
unipredi$predi28<-(unipredi$network+unipredi$AIC+unipredi$logi+unipredi$SES)/4
unipredi$predi29<-(unipredi$RF+unipredi$network+unipredi$logi+unipredi$SES)/4
unipredi$predi30<-(unipredi$RF+unipredi$network+unipredi$AIC+unipredi$SES)/4
#All 5
unipredi$predi31<-(unipredi$SES+unipredi$network+unipredi$logi+unipredi$AIC+unipredi$RF)/5
c("pred", "obs", "No", "Yes", "rowIndex", "parameter", "Resample",
"Fold", "Rep", "logi", "size", "decay", "bag", "network", "SES",
"tasa", "auc", "mtry", "RF", "AIC", "predi6", "predi7", "predi8",
"predi9", "predi10", "predi11", "predi12", "predi13", "predi14",
"predi15", "predi16", "predi17", "predi18", "predi19", "predi20",
"predi21", "predi22", "predi23", "predi24", "predi25", "predi26",
"predi27", "predi28", "predi29", "predi30", "predi31")
listado<-c("logi", "network", "AIC", "RF","SES", "predi6", "predi7",
"predi8", "predi9", "predi10", "predi11", "predi12", "predi13", "predi14", "predi15", "predi16", "predi17", "predi18", "predi19", "predi20"
, "predi21", "predi22", "predi23", "predi24","predi25", "predi26", "predi28", "predi29", "predi30","predi31")
tasafallos<-function(x,y) {
confu<-confusionMatrix(x,y)
tasa<-confu[[3]][1]
return(tasa)
}
auc<-function(x,y) {
curvaroc<-roc(response=x,predictor=y)
auc<-curvaroc$auc
return(auc)
}
repeticiones<-nlevels(factor(unipredi$Rep))
unipredi$Rep<-as.factor(unipredi$Rep)
unipredi$Rep<-as.numeric(unipredi$Rep)
medias0<-data.frame(c())
for (prediccion in listado)
{
unipredi$proba<-unipredi[,prediccion]
unipredi[,prediccion]<-ifelse(unipredi[,prediccion]>0.5,"Yes","No")
for (repe in 1:repeticiones)
{
paso <- unipredi[(unipredi$Rep==repe),]
pre<-factor(paso[,prediccion])
archi<-paso[,c("proba","obs")]
archi<-archi[order(archi$proba),]
obs<-paso[,c("obs")]
tasa=1-tasafallos(pre,obs)
t<-as.data.frame(tasa)
t$modelo<-prediccion
auc<-suppressMessages(auc(archi$obs,archi$proba))
t$auc<-auc
medias0<-rbind(medias0,t)
}
}
We display the sorted boxplots to see the Error Rate and AUC.
set.seed(1233)
medias0$modelo <- with(medias0,
reorder(modelo,tasa, mean))
par(cex.axis=0.7,las=2)
boxplot(data=medias0,tasa~modelo,col="red", main='Error Rate')
tablamedias2<-medias0 %>%
group_by(modelo) %>%
summarize(auc=mean(auc))
tablamedias2<-tablamedias2[order(-tablamedias2$auc),]
medias0$modelo <- with(medias0,
reorder(modelo,auc, mean))
par(cex.axis=0.7,las=2)
boxplot(data=medias0,auc~modelo,col="blue", main='AUC')
We can see that the top 5 models for AUC are different to the once of failure rate However it can be observed that in both cases the top 5 conist of assemblies, proving that combining our models is advantageous.
Top 5 AUC: predi8, predi20, predi7, predi22, predi18 and predi8 has the lowest failure rate of them on place 6.
Top 5 failure rate: predi26, predi13, predi21, predi10, predi23 and predi26 has the highest AUC on place 11.
predi25 seems to have the best mix. So we are going with predi30, predi28 and predi25.
For the top 5 AUC assemblies, GBM is present in all of them while in the case of failure rate AIC is present in all of them.
predi25 includes both AIC and gbm
Top 5 Accuracy:
| Rank | Ensemble | Model included |
|---|---|---|
| 1 | Predi8 | RF, AIC |
| 2 | Predi20 | RF, Network, AIC |
| 3 | Predi7 | RF, SES |
| 4 | Predi22 | RF, AIC, SES |
| 5 | Predi18 | RF, Network, SES |
Top 5 Error Rate:
| Rank | Ensemble | Models included |
|---|---|---|
| 1 | Predi26 | RF, Network, AIC, Logi |
| 2 | Predi13 | Network, AIC |
| 3 | Predi21 | RF, AIC, logi |
| 4 | Predi10 | Network, logi |
| 5 | Predi23 | Network, AIC, logi |
We will do a hypothesis tests between the best AUC and Error Rate models to see which one to choose.
listamodelos<-c("predi8","predi26")
datacontraste<-medias0[which(medias0$modelo%in%listamodelos),]
# Error Rates
res <- t.test(datacontraste$tasa ~datacontraste$modelo)
res
Welch Two Sample t-test
data: datacontraste$tasa by datacontraste$modelo
t = -3.3171, df = 94.575, p-value = 0.001292
alternative hypothesis: true difference in means between group predi26 and group predi8 is not equal to 0
95 percent confidence interval:
-0.0018619297 -0.0004676414
sample estimates:
mean in group predi26 mean in group predi8
0.1476930 0.1488578
# AUC
res <- t.test(datacontraste$auc ~datacontraste$modelo)
res
Welch Two Sample t-test
data: datacontraste$auc by datacontraste$modelo
t = -8.9707, df = 97.431, p-value = 2.163e-14
alternative hypothesis: true difference in means between group predi26 and group predi8 is not equal to 0
95 percent confidence interval:
-0.0014994389 -0.0009561763
sample estimates:
mean in group predi26 mean in group predi8
0.9071112 0.9083390
In both cases the p-value is below 0.05. Therefore, it can be concluded that the differences in Error Rate and Accuracy are significantly different from zero. Predi26 is significantly better than predi8 in Error Rate while Predi8 is significantly stronger in terms of AUC. However, the difference in AUC between the models is considerably higher, indicating that choosing predi8 as best model would be the better choice. This is not surprising when considering that predi8 is only sixth place in Error Rate while Predi26 is nineth place in AUC.
| Measure | Predi8 | Predi26 | p-value |
|---|---|---|---|
| AUC | 0.9083 | 0.9071 | 2.163e-14 |
| Error Rate | 0.1489 | 0.1477 | 0.00129 |
The first step we will do in our conclusion part is to construct the confusion matrix based on our model predi8. The confusion matrix gives an overview of the agreement between the predictions of our model and the actual values. We will use predi8, as it is the best developed and is part of the best ensembles. We will use repeated cv with 6 replicates.
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 8422 1120
Yes 752 2996
Accuracy : 0.8591
95% CI : (0.8531, 0.865)
No Information Rate : 0.6903
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6622
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7279
Specificity : 0.9180
Pos Pred Value : 0.7994
Neg Pred Value : 0.8826
Prevalence : 0.3097
Detection Rate : 0.2254
Detection Prevalence : 0.2820
Balanced Accuracy : 0.8230
'Positive' Class : Yes
| Prediction/Reference | No | Yes |
|---|---|---|
| No | 8422 | 1120 |
| Yes | 752 | 2996 |
| Measure | Score |
|---|---|
| Accuracy | 0.8591 |
| Kappa | 0.6622 |
| Sensitivity | 0.7279 |
| Specificity | 0.9180 |
The numbers of true positives (TP), true negative (TN), false positives (FP) and false negatives (FN) can be observed. It is important to mention that due to the six repetitions performed, the total of observations is six times the number in the original dataset (2215). This allows for a more reliable interpretation. TP refers to the number of data points that are positive and were correctly classified by the model. In this case the selected algorithm predicts the positive class (above-average violence rate) 72.8% of the time correctly. In contrast, true negatives indicate how many datapoints that belong to the negative class were correctly predicted, which in this case are 91.8%. This rate is being referred to as Specificity. Hence the overall Accuracy of the model is 85.91%, which in the context of the data can be considered strong. The fact that the negative cases have a better probability to be classified correctly than the positive cases comes to no surprise when recalling that the majority class is 0. To examine whether a higher Sensitivity level is possible, another confusion matrix will be created, but through specifying the threshold. For this to be achieved, the threshold has to be lowered in order to increase the probability of classifying the positives correctly.
library(randomForest)
library(caret)
library(dplyr) #
# Define parameters
n_repeats <- 6
set.seed(12345)
all_predictions <- list()
all_actuals <- list()
for (i in 1:n_repeats) {
rfModel <- randomForest(as.factor(violence) ~ ., data = data2,
ntree = 400, mtry = 2, nodesize = 20,
replace = TRUE, sampsize = 195)
rfPred <- predict(rfModel, newdata = data2, type = "prob")[, "Yes"]
logitModel <- glm(factor(violence) ~ ., data = crime_v_final2, family = "binomial")
logitPred <- predict(logitModel, newdata = crime_v_final2, type = "response")
ensemblePred <- (rfPred + logitPred) / 2
finalPred <- ifelse(ensemblePred > 0.3, "Yes", "No")
finalPred <- factor(finalPred, levels = c("No", "Yes"))
all_predictions[[i]] <- finalPred
all_actuals[[i]] <- factor(data2$violence, levels = c("No", "Yes"))
}
combined_predictions <- bind_rows(lapply(1:n_repeats, function(i) {
data.frame(pred = all_predictions[[i]], obs = all_actuals[[i]])
}))
cm <- confusionMatrix(reference = combined_predictions$obs, data = combined_predictions$pred, positive = "Yes")
print(cm)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 7618 558
Yes 1556 3558
Accuracy : 0.8409
95% CI : (0.8346, 0.8471)
No Information Rate : 0.6903
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6513
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8644
Specificity : 0.8304
Pos Pred Value : 0.6957
Neg Pred Value : 0.9318
Prevalence : 0.3097
Detection Rate : 0.2677
Detection Prevalence : 0.3848
Balanced Accuracy : 0.8474
'Positive' Class : Yes
| Prediction/Reference | No | Yes |
|---|---|---|
| No | 7618 | 558 |
| Yes | 1556 | 3558 |
| Measure | Score |
|---|---|
| Accuracy | 0.8409 |
| Kappa | 0.6513 |
| Sensitivity | 0.8644 |
| Specificity | 0.8304 |
With a threshold of 0.3 the Sensitivity rises significantly, affecting however the Specificity. Nevertheless, both rates are above 0.8 and therefore indicate strong predictive capabilities. Accuracy declines as well, but still still within acceptable ranges, while Kappa rises. Therefore a threshold of 0.3 is recommended for this model.
set.seed(1234)
data3$violence <- as.factor(data3$violence)
data3$violence <- relevel(data3$violence,ref="No")
var <- c("racePctWhite", "TotalPctDiv", "blackPerCap",
"PctSpeakEnglOnly", "LemasPctOfficDrugUn", "pctWWage", "PctSameHouse85"
)
vardep <- "violence"
data4<-data3[,c(var,"violence")]
logistica<-glm(factor(violence)~.,data=data4,family="binomial")
summary(logistica)
Call:
glm(formula = factor(violence) ~ ., family = "binomial", data = data4)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1127 -0.5548 -0.2486 0.4619 3.1920
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.39208 0.07982 -17.440 < 2e-16 ***
racePctWhite -1.09412 0.08255 -13.254 < 2e-16 ***
TotalPctDiv 0.86098 0.08590 10.023 < 2e-16 ***
blackPerCap -0.63029 0.13923 -4.527 5.98e-06 ***
PctSpeakEnglOnly -0.38372 0.06685 -5.740 9.46e-09 ***
LemasPctOfficDrugUn 0.22794 0.06456 3.531 0.000414 ***
pctWWage -0.36793 0.07561 -4.866 1.14e-06 ***
PctSameHouse85 -0.30474 0.07657 -3.980 6.90e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2741.6 on 2214 degrees of freedom
Residual deviance: 1637.0 on 2207 degrees of freedom
AIC: 1653
Number of Fisher Scoring iterations: 6
| Variable | Coefficient |
|---|---|
| racePctWhite | -1.09412 |
| TotalPctDiv | 0.86098 |
| blackPerCap | -0.63029 |
| PctSpeakEnglOnly | -0.38372 |
| LemasPctOfficDrugUn | 0.22794 |
| pctWWage | -0.36793 |
| PctSameHouse85 | -0.30474 |
As can be seen in this table, the variable TotalPctDiv has the highest positive coefficients of 0.86. This implies that a community that has a higher value in divorce rates is associated with a higher likelihood to have violence rates above national average. In contrast the variables blackPerCap (per capita income for Blacks) and racePctWhite (percentage of Whites) have a negative impact on the violence rate. A high number of well-earning African-Americans and high percentage of Caucasians within the community, actively lowers the probability of having a violent crime rate above national average. These three mentioned variables have most significant impact in absolute terms on a community’s violence occurrences, but racePctWhite is the only one that surpasses a level of +-1 and has therefore a very strong influence.
In order to also offer an easy interpretable analysis to the decision-makers, a decision tree will be executed. It will also serve to see which variables are being regarded as important and how they overlap with the ones above. When creating the decision tree its parameters will be chosen in such way that at least 10 observations must be included in every leaf (minbucket=10), which in turn prevents small splits at node level, a tree that is too deep and overfitting. Furthermore, to guarantee the most effective splits by decreasing the impurity of the corresponding child node, the Gini impurity criterion will be taken advantage of.
# Arbol simple
library(rpart)
library(rpart.plot)
set.seed(123)
arbol1 <- rpart(violence ~ ., data = data3,
minbucket =10,method = "class",parms=list(split="gini"),cp=0)
rpart.plot(arbol1,extra=105,nn=TRUE, tweak=1.2)
Variables that were selected are: racePctWhite, blackPerCap, TotalPctDiv, PctSpeakEnglOnly, PctSameHouse85, LemasPctOfficeDrugUn, pctWWage, which corresponds exactly to the ones selected by BIC.
| Model | AUC | Error Rate |
|---|---|---|
| Random Forest | 0.9055 | 0.1553 |
| Logi | 0.9020 | 0.1547 |
| Neuronal Network | 0.9035 | 0.1505 |
| Predi8 | 0.9083 | 0.1489 |
| Predi26 | 0.9071 | 0.1477 |
Looking at this final table comparing the AUC and error rate of the best models of the ensemble and our simpler models developed throughout this work, we can see that the best model for AUC is predi8, a combination of the RF and AIC models. However, for Error Rate Predi26 has the lowest score. Nevertheless, after having made the hypothesis testing we can conclude that it is better to stay with predi8, since there since the difference in AUC between the two models’ means is higher and more significant than the difference in Error Rate.